ISOLATED SHARED MEMORY ARCHITECTURE (iSMA)

ABSTRACT

Techniques for a massively parallel and memory centric computing system. The system has a plurality of processing units operably coupled to each other through one or more communication channels. Each of the plurality of processing units has an ISMn interface device. Each of the plurality of ISMn interface devices is coupled to an ISMe endpoint connected to each of the processing units. The system has a plurality of DRAM or Flash memories configured in a disaggregated architecture and one or more switch nodes operably coupling the plurality of DRAM or Flash memories in the disaggregated architecture. The system has a plurality of high speed optical cables configured to communicate at a transmission rate of 100 G or greater to facilitate communication from any one of the plurality of processing units to any one of the plurality of DRAM or Flash memories.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of and claims priority to U.S.application Ser. No. 14/194,574, filed Feb. 28, 2014, which claimspriority to and is a continuation-in-part of U.S. patent applicationSer. No. 14/187,082, filed on Feb. 21, 2014, which is a non-provisionalof U.S. Provisional Application No. 61/781,928, filed on Mar. 14, 2013,which are incorporated by reference in their entirety.

BACKGROUND OF THE INVENTION

The present invention is directed to computing systems and methods.These computing systems can be applied to communications networks andthe like.

Over the last few decades, the use of communication networks exploded.In the early days Internet, popular applications were limited to emails,bulletin board, and mostly informational and text-based web pagesurfing, and the amount of data transferred was usually relativelysmall. Today, Internet and mobile applications demand a huge amount ofbandwidth for transferring photo, video, music, and other multimediafiles. For example, a social network like Facebook processes more than500 TB of data daily. With such high demands on data and data transfer,existing data communication systems need to be improved to address theseneeds.

CMOS technology is commonly used to design communication systemsimplementing Optical Fiber Links. As CMOS technology is scaled down tomake circuits and systems run at higher speed and occupy smaller chip(die) area, the operating supply voltage is reduced for lower power.Conventional FET transistors in deep-submicron CMOS processes have verylow breakdown voltage as a result the operating supply voltage ismaintained around 1 Volt. These limitations provide significantchallenges to the continued improvement of communication systems scalingand performance.

There have been many types of communication systems and methods.Unfortunately, they have been inadequate for various applications.Therefore, improved computing/communication systems and methods aredesired.

BRIEF SUMMARY OF THE INVENTION

According to the present invention, techniques are directed to computingsystems and methods. Additionally, various embodiments enable separatecomputer systems having such memory systems to send and receive data toand from other memory systems having such auxiliary interfaces.

In an example, the present invention provides a massively parallel andmemory centric computing system. The system has a plurality ofprocessing units operably coupled to each other through one or morecommunication channels. Each of the plurality of processing units has anISMn interface device. Each of the plurality of ISMn interface devicesis coupled to an ISMe endpoint connected to each of the processingunits. The system has a plurality of DRAM or Flash memories configuredin a disaggregated architecture and one or more switch nodes operablycoupling the plurality of DRAM or Flash memories in the disaggregatedarchitecture. The system has a plurality of high speed optical cablesconfigured to communicate at a transmission rate of 100 G or greater tofacilitate communication from any one of the plurality of processingunits to any one of the plurality of DRAM or Flash memories.

Many benefits are recognized through various embodiments of the presentinvention. Such benefits include having an architecture exhibitingsuperior power efficiency and in-memory computing efficiency. Thisarchitecture can involve disaggregating a large pool of memory (NANDflash or DRAM) that is shared amongst multiple CPU server nodes. Anotherbenefit includes low-latency and high-bandwidth interconnectarchitecture amongst multiple CPU server nodes. Other benefits will berecognized by those of ordinary skill in the art that the mechanismsdescribed can be applied to other communications systems as well.

The present invention achieves these benefits and others in the contextof known memory technology. These and other features, aspects, andadvantages of the present invention will become better understood withreference to the following description, figures, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The following diagrams are merely examples, which should not undulylimit the scope of the claims herein. One of ordinary skill in the artwould recognize many other variations, modifications, and alternatives.It is also understood that the examples and embodiments described hereinare for illustrative purposes only and that various modifications orchanges in light thereof will be suggested to persons skilled in the artand are to be included within the spirit and purview of this process andscope of the appended claims.

FIG. 1 is a simplified architecture of a shared memory system accordingto an embodiment of the present invention.

FIG. 2 is a simplified architecture of an in memory computing systemaccording to an embodiment of the present invention.

FIG. 3 is a table with information regarding the computing systemsaccording to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

According to the present invention, techniques are directed to computingsystems and methods. Additionally, various embodiments enable separatecomputer systems having such memory systems to send and receive data toand from other memory systems having such auxiliary interfaces.

The following description is presented to enable one of ordinary skillin the art to make and use the invention and to incorporate it in thecontext of particular applications. Various modifications, as well as avariety of uses in different applications will be readily apparent tothose skilled in the art, and the general principles defined herein maybe applied to a wide range of embodiments. Thus, the present inventionis not intended to be limited to the embodiments presented, but is to beaccorded the widest scope consistent with the principles and novelfeatures disclosed herein.

In the following detailed description, numerous specific details are setforth in order to provide a more thorough understanding of the presentinvention. However, it will be apparent to one skilled in the art thatthe present invention may be practiced without necessarily being limitedto these specific details. In other instances, well-known structures anddevices are shown in block diagram form, rather than in detail, in orderto avoid obscuring the present invention.

The reader's attention is directed to all papers and documents which arefiled concurrently with this specification and which are open to publicinspection with this specification, and the contents of all such papersand documents are incorporated herein by reference. All the featuresdisclosed in this specification, (including any accompanying claims,abstract, and drawings) may be replaced by alternative features servingthe same, equivalent or similar purpose, unless expressly statedotherwise. Thus, unless expressly stated otherwise, each featuredisclosed is one example only of a generic series of equivalent orsimilar features.

Furthermore, any element in a claim that does not explicitly state“means for” performing a specified function, or “step for” performing aspecific function, is not to be interpreted as a “means” or “step”clause as specified in 35 U.S.C. Section 112, Paragraph 6. Inparticular, the use of “step of” or “act of” in the Claims herein is notintended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.

Please note, if used, the labels left, right, front, back, top, bottom,forward, reverse, clockwise and counter clockwise have been used forconvenience purposes only and are not intended to imply any particularfixed direction. Instead, they are used to reflect relative locationsand/or directions between various portions of an object.

This invention describes an architecture for disaggregating a large poolof memory (NAND flash or DRAM) that is shared amongst multiple CPUserver nodes. Another aspect of this invention is a low-latency andhigh-bandwidth interconnect architecture amongst multiple CPU servernodes. The notion of disaggregating storage, memory, and IO devices frommonolithic designs is gaining importance and is being driven by thefollowing considerations:

Much of today's hardware is highly monolithic in that our CPUs areinextricably linked to our motherboards, which in turn are linked tospecific networking technology, IO, storage, and memory devices. Thisleads to poorly configured systems that cannot adapt to evolvingsoftware and waste lots of energy and material. Disaggregation is a wayto break these monolithic designs.

Disaggregation allows independent replacement or upgrade of variousdisaggregated components. This reduces upgrade costs as opposed toincreased costs due to gratuitous upgrade of components in monolithicdesigns.

FIG. 1 illustrates a block diagram of a 64-server node iSMA. iSMAcomprises two components: 1) a PCI-express endpoint device called iSMeand 2) a switching node called iSMn. Each of the server CPUs has a localDRAM (not shown in FIG. 1) and is connected to PCIe endpoint iSMe. TheiSMe components of each server node connect to one of the iSMn switchnodes. Attached to each iSMn node is a plurality of DRAM memory channels(shown in FIG. 1) or flash memory devices (not shown in FIG. 1). All ofthe iSMn nodes are interconnected thereby forming a shared memoryinterconnect fabric.

The following describes a mode of operation of iSMA according to anembodiment of the present invention. Upon power-on or system boot, eachof the iSMn nodes discovers the locally attached DRAM or flash memorycapacity. The iSMn nodes broadcast amongst each of the connected nodesthe DRAM/flash memory capacity and the topology information. After asettling time, all of the iSMn nodes learn the topology as well as thesum-total memory capacity information. The topology informationcomprises the number of connected server CPUs and identification of theconnected server CPUs to the iSMn nodes.

The iSMn nodes communicate the topology and memory capacity informationto the iSMe endpoints via upstream transactions. The iSMe nodessubsequently communicate this topology and sum-total memory capacityinformation to their respective server CPUs during PCIe enumeration. Inparticular, the sum-total memory capacity information is reported to therespective server CPU as an address range in a PCIe endpoint baseaddress register (BAR).

The reporting of the sum-total memory through a BAR allows each of theserver CPUs to have a common address view of the disaggregated memory.Also, the BAR range reporting of the disaggregated memory allows mappingof the physical address range of disaggregated memory into a commonvirtual address range. Thereby allowing caching of virtual to physicaladdress translations of disaggregated memory via the translationlook-aside buffers in the server CPUs.

The visibility of the disaggregated memory as a common virtual addresssimplifies programming models. Also, sharing of this common pool ofdisaggregated memory by server CPUs is decided through softwareconvention and is influenced by the application use case models.

In an example, the iSMn nodes also have processing capability to do datatransformation operations to the locally connected memory. The serverCPUs, through posted-write transactions or through downloaded executableprograms in disaggregated memory, communicate the nature of datatransformation. The iSMn nodes with their local processing capabilityact upon these posted transactions or executable programs to performdata transformation operations. These data transformation operations areoften called in-memory computations.

FIG. 2 is a simplified architecture of an in memory computing systemaccording to an embodiment of the present invention. In-memory computecapability allows server CPUs to off-load various data transformationoperations for large pools of data stored in disaggregated memory toiSMn nodes. This offloading of operations is mostly beneficial both inperformance and energy for large data set payloads that show poor cachelocality. Thereby, moving computation closer to the memory results inboth power and performance efficiency. FIG. 2 illustrates the in-memorycompute idea for an 8-server node configuration connected via iSMeendpoints to a single iSMn node.

FIG. 3 is a table with information regarding the computing systemsaccording to an embodiment of the present invention. To demonstrate theefficiency of in-memory compute capability of the iSMn nodes, weestimated the performance improvement in GUPs benchmark. The followingtable illustrates the performance improvements estimates. The estimatesdemonstrate that we can get from two to three orders of magnitudeperformance improvement by offloading data transformation operations todisaggregated memory for GUPs class of applications.

In various embodiments, a memory buffer as described herein could beimplemented as a single integrated circuit (IC), or with a multiple chipchipset with various functions spread among several ICs. For example, amemory system based on the DDR4 standard employs DIMMs which includenine separate data buffer chips arranged close to the connector contactsand provides an interface between the connector and the DRAMs. Thestandard also provides for a central control element which functions asthe register section of the DIMM and includes an extra interface tocontrol the data buffers. For this type of chipset implementation,implementing an auxiliary port as described herein requires a new pathfrom the data buffers to the central controller.

In an embodiment, the present invention can include a massively paralleland memory centric computing system. This system can include a pluralityof processing units operably coupled to each other through one or morecommunication channels. Each of the plurality of processing units canhave an ISMn (Isolated Shared Memory network) interface device. Each ofthe plurality of ISMn interface devices can be coupled to an ISMe(Isolated Shared Memory endpoint) device connected to each of theprocessing units. Each of the plurality of processing units can benumbered from 1 through N, where N is an integer greater than or equalto 32. Each of these processing units can be an ARM or an Intel basedx86 processor.

In a specific embodiment, the system can be configured to initiate apower on or system boot. Each of the iSMn interface devices can beconfigured to determine a capacity of any one or all of the plurality ofDRAM or Flash memories. Each of the iSMn interface devices can beconfigured to communicate in a broadcast process among any other iSMninterface device, each of which can be coupled to at least one of theplurality of DRAM or Flash memories. This broadcast process can beprovided to determine a capacity and a topology of any or all of thesystem including the plurality of DRAM or Flash memories or networkingconfiguration. The topology can include information selected from atleast one of a number of connected processing units and identificationinformation of the processing units to the iSMn devices.

Also, each of the iSMn devices can be configured to initiatecommunication of the topology can capacity information to any one or allof the iSMe devices using a communication direction from iSMn devices tothe iSMe devices. Each of the iSMe devices can be configured tothereafter communicate the topology and a collective capacity of asum-total of the capacity to a particular processing unit during a PCIeenumeration process. The sum-total memory capacity information can betransferred to a particular processing unit as an address range in aPCIe endpoint base address register.

The transferring of the sum-total memory capacity can be provided usinga base address register (BAR) characterized by allowing each of theprocessing units to have a common address view of the disaggregatedmemory. The BAR range reporting of the disaggregated memory can allowmapping of a physical address range of the disaggregated memory into acommon virtual address range. This can allow the caching of a virtual tophysical address translation of the disaggregated memory provided by atranslation look-aside buffer in the processing unit. The common addressview of the disaggregated memory can be configured as a common virtualaddress.

A plurality of DRAM or Flash memories can be configured in adisaggregated architecture. One or more switch nodes can be operablycoupled to the plurality of DRAM or Flash memories in the disaggregatedarchitecture. Also, a plurality of high speed optical cables can beconfigured to communicate at a transmission rate of 100 G or greater tofacilitate communication from any one of the plurality of processingunits to any one of the plurality of DRAM or Flash memories. Each of theplurality of high speed optical cables can have a length of 1 meter toabout 10 kilometers. The transmission rate can be 100 G PAM or otherprotocol.

The embodiments shown in the figures and described above are merelyexemplary. The present system encompasses any memory system whichemploys a memory buffer that serves as an interface between theindividual memory chips on a DIMM and a host, and which includes atleast one additional, auxiliary interface which enables the memorybuffer to serve as an interface between the host and/or memory chips andadditional external devices.

In other embodiments, a system may include more than one host computer(each with host controller) wherein each host computer includes a memorybuffer having a RAM interface and an auxiliary interface, as describedherein. The auxiliary interfaces of the memory buffer of one hostcomputer may be directly coupled to an auxiliary interface of the memorybuffer of another host computer, or may be coupled via one or moreswitches. As described herein, such configurations enable the transferof data from one RAM to another RAM bypassing data paths of the hostcontrollers.

Various example embodiments as described with reference to theaccompanying drawings, in which embodiments have been shown. Thisinventive concept may, however, be embodied in many different forms andshould not be construed as limited to the embodiments set forth herein.Rather, these embodiments are provided so that this disclosure isthorough and complete, and has fully conveyed the scope of the inventiveconcept to those skilled in the art. Like reference numerals refer tolike elements throughout this application.

It has been understood that, although the terms first, second, etc. maybe used herein to describe various elements, these elements should notbe limited by these terms. These terms are used to distinguish oneelement from another. For example, a first element could be termed asecond element, and, similarly, a second element could be termed a firstelement, without departing from the scope of the inventive concept. Asused herein, the term “and/or” includes any and all combinations of oneor more of the associated listed items.

It has be understood that when an element is referred to as being“connected” or “coupled” to another element, it can be directlyconnected or coupled to the other element or intervening elements may bepresent. In contrast, when an element is referred to as being “directlyconnected” or “directly coupled” to another element, there may be nointervening elements present. Other words used to describe therelationship between elements should be interpreted in a like fashion(e.g., “between” versus “directly between,” “adjacent” versus “directlyadjacent,” etc.).

The terminology used herein is for the purpose of describing particularembodiments and is not intended to be limiting of the inventive concept.As used herein, the singular forms “a,” “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises,”“comprising,” “includes” and/or “including,” when used herein, specifythe presence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other.

Unless otherwise defined, all terms (including technical and scientificterms) used herein have the same meaning as commonly understood by oneof ordinary skill in the art to which this inventive concept belongs. Ithas been be further understood that terms, such as those defined incommonly used dictionaries, should be interpreted as having a meaningthat is consistent with their meaning in the context of the relevant artand will not be interpreted in an idealized or overly formal senseunless expressly so defined herein.

While the above is a full description of the specific embodiments,various modifications, alternative constructions and equivalents may beused. Therefore, the above description and illustrations should not betaken as limiting the scope of the present invention which is defined bythe appended claims.

What is claimed is:
 1. A massively parallel and memory centric computingsystem, the system comprising: an ISMn (Isolated Shared Memory network)provided in each of a plurality of processing units, each of theplurality of processing units operably coupled to each other through atleast a communication channel, the ISMn interface device being coupledan ISMe (Isolated Shared Memory end point) device connected to each ofthe processing units; a disaggregated architecture comprising aplurality of DRAM or Flash memories configured in the disaggregatedarchitecture; a switch node operably coupling the plurality of DRAM orFlash memories in the disaggregated architecture; and a high speedoptical cable configured to communicate at a transmission rate of 100 Gor greater to facilitate communication from any one of the plurality ofprocessing units to any one of the plurality of DRAM or Flash memories.2. The system of claim 1 wherein the plurality of high speed opticalcables having a length of 1 meter to about 10 kilometers.
 3. The systemof claim 1 wherein the transmission rate is 100 G PAM or other protocol.4. The system of claim 1 wherein the plurality of processing units is anumber from 1 through N, where N is an integer greater than or equal tothirty two.
 5. The system of claim 1 wherein each of the processingunits is either an ARM or an Intel based x86 processor.
 6. The system ofclaim 1 wherein the system is configured to initiate a power on orsystem boot, the iSMn interface devices being configured to determine acapacity of any one or all of the plurality of DRAM or Flash memories.7. The system of claim 1 wherein the iSMn interface devices isconfigured to communicate in a broadcast process among any other iSMninterface device, each of which is coupled to at least one of theplurality of DRAM or Flash memories; whereupon the broadcast process isprovided to determine a capacity and a topology of any or all of thesystem including the plurality of DRAM or Flash memories or networkingconfiguration.
 8. The system of claim 7 wherein the topology comprisesinformation selected from at least one of a number of connectedprocessing units and identification information of the processing unitsto the iSMn devices.
 9. The system of claim 8 wherein the iSMn device isconfigured to initiate communication of the topology and capacityinformation to the iSMe device using a communication direction from iSMndevice to the iSMe device.
 10. The system of claim 9 wherein the iSMedevices is configured to thereafter communicate the topology and acollective capacity of a sum-total of the capacity to a particularprocessing unit during a PCIe enumeration process.
 11. The system ofclaim 10 wherein the sum-total memory capacity information istransferred to a particular processing unit as an address range in aPCIe endpoint base address register.
 12. The system of claim 11 whereintransferring of the sum-total memory capacity is provided using a baseaddress register (BAR) characterized by allowing each of the processingunits to have a common address view of the disaggregated memory.
 13. Thesystem of claim 12 wherein the BAR range reporting of the disaggregatedmemory is configured to provide a mapping of a physical address range ofthe disaggregated memory into a common virtual address range, therebyconfigured to provide caching of a virtual to physical addresstranslation of the disaggregated memory provided by a translationlook-aside buffer in the processing unit.
 14. The system of claim 13wherein the common address view of the disaggregated memory isconfigured as a common virtual address.
 15. A massively parallel andmemory centric computing system, the system comprising: a plurality ofprocessing units operably coupled to each other through a communicationchannel; an ISMe (Isolated Shared Memory endpoint) device coupled toeach of the processing units; an ISMn (Isolated Shared Memory network)interface device coupled to each of the ISMe devices; a disaggregatedarchitecture comprising a plurality of DRAM or Flash memories configuredin the disaggregated architecture and coupled to the plurality of iSMninterface devices; a switch node operably coupling the plurality of DRAMor Flash memories in the disaggregated architecture; and a plurality ofhigh speed optical cables configured to communicate at a transmissionrate of 100 G or greater to facilitate communication from any one of theplurality of processing units to any one of the plurality of DRAM orFlash memories.
 16. The system of claim 15 wherein each of the iSMninterface devices is configured to communicate in a broadcast processamong any other iSMn interface device, each of which is coupled to atleast one of the plurality of DRAM or Flash memories; whereupon thebroadcast process is provided to determine a capacity and a topology ofany or all of the system including the plurality of DRAM or Flashmemories or networking configuration.
 17. The system of claim 16 whereinthe topology comprises information selected from at least one of anumber of connected processing units and identification information ofthe processing units to the iSMn devices; and wherein each of the iSMndevices is configured to initiate communication of the topology andcapacity information to any one or all of the iSMe devices using acommunication direction from iSMn devices to the iSMe devices.
 18. Thesystem of claim 17 wherein each of the iSMe devices is configured tothereafter communicate the topology and a collective capacity of asum-total of the capacity to a particular processing unit during a PCIeenumeration process; and wherein the sum-total memory capacityinformation is transferred to a particular processing unit as an addressrange in a PCIe endpoint base address register.
 19. The system of claim18 wherein transferring of the sum-total memory capacity is providedusing a base address register (BAR) characterized by allowing each ofthe processing units to have a common address view of the disaggregatedmemory; and wherein the BAR range reporting of the disaggregated memoryis configured to provide a mapping of a physical address range of thedisaggregated memory into a common virtual address range, therebyconfigured to provide caching of a virtual to physical addresstranslation of the disaggregated memory provided by a translationlook-aside buffer in the processing unit.
 20. The system of claim 19wherein the common address view of the disaggregated memory isconfigured as a common virtual address.